PRUEBA TÉCNICA GRUPO BANCOLOMBIA¶

Descripción del problema¶

La empresa procesa cientos de miles de documentos de los clientes que se requieren registrar en nuestra plataforma, para alivianar nuestra carga operativa, requerimos una solución donde podamos automatizar la carga de documentos, pero a su vez el análisis de los mismos.

Se instalan las librerias a usar¶

Boto3 - desde AWS¶
Thefuzz - coincidencia de direcciones a través de lógica difusa¶
In [1]:
!pip install boto3
!pip install thefuzz
!pip install nbconvert
!pip install pyppeteer
Requirement already satisfied: boto3 in c:\users\ander\anaconda3\lib\site-packages (1.24.28)
Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in c:\users\ander\anaconda3\lib\site-packages (from boto3) (0.10.0)
Requirement already satisfied: botocore<1.28.0,>=1.27.28 in c:\users\ander\anaconda3\lib\site-packages (from boto3) (1.27.28)
Requirement already satisfied: s3transfer<0.7.0,>=0.6.0 in c:\users\ander\anaconda3\lib\site-packages (from boto3) (0.6.0)
Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in c:\users\ander\anaconda3\lib\site-packages (from botocore<1.28.0,>=1.27.28->boto3) (2.8.2)
Requirement already satisfied: urllib3<1.27,>=1.25.4 in c:\users\ander\anaconda3\lib\site-packages (from botocore<1.28.0,>=1.27.28->boto3) (1.26.11)
Requirement already satisfied: six>=1.5 in c:\users\ander\anaconda3\lib\site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.28.0,>=1.27.28->boto3) (1.16.0)
Requirement already satisfied: thefuzz in c:\users\ander\anaconda3\lib\site-packages (0.20.0)
Requirement already satisfied: rapidfuzz<4.0.0,>=3.0.0 in c:\users\ander\anaconda3\lib\site-packages (from thefuzz) (3.4.0)
Requirement already satisfied: nbconvert in c:\users\ander\anaconda3\lib\site-packages (6.4.4)
Requirement already satisfied: jupyter-core in c:\users\ander\anaconda3\lib\site-packages (from nbconvert) (4.11.1)
Requirement already satisfied: nbformat>=4.4 in c:\users\ander\anaconda3\lib\site-packages (from nbconvert) (5.5.0)
Requirement already satisfied: defusedxml in c:\users\ander\anaconda3\lib\site-packages (from nbconvert) (0.7.1)
Requirement already satisfied: testpath in c:\users\ander\anaconda3\lib\site-packages (from nbconvert) (0.6.0)
Requirement already satisfied: entrypoints>=0.2.2 in c:\users\ander\anaconda3\lib\site-packages (from nbconvert) (0.4)
Requirement already satisfied: traitlets>=5.0 in c:\users\ander\anaconda3\lib\site-packages (from nbconvert) (5.1.1)
Requirement already satisfied: pygments>=2.4.1 in c:\users\ander\anaconda3\lib\site-packages (from nbconvert) (2.11.2)
Requirement already satisfied: bleach in c:\users\ander\anaconda3\lib\site-packages (from nbconvert) (4.1.0)
Requirement already satisfied: pandocfilters>=1.4.1 in c:\users\ander\anaconda3\lib\site-packages (from nbconvert) (1.5.0)
Requirement already satisfied: jinja2>=2.4 in c:\users\ander\anaconda3\lib\site-packages (from nbconvert) (2.11.3)
Requirement already satisfied: jupyterlab-pygments in c:\users\ander\anaconda3\lib\site-packages (from nbconvert) (0.1.2)
Requirement already satisfied: mistune<2,>=0.8.1 in c:\users\ander\anaconda3\lib\site-packages (from nbconvert) (0.8.4)
Requirement already satisfied: beautifulsoup4 in c:\users\ander\anaconda3\lib\site-packages (from nbconvert) (4.11.1)
Requirement already satisfied: nbclient<0.6.0,>=0.5.0 in c:\users\ander\anaconda3\lib\site-packages (from nbconvert) (0.5.13)
Requirement already satisfied: MarkupSafe>=0.23 in c:\users\ander\anaconda3\lib\site-packages (from jinja2>=2.4->nbconvert) (2.0.1)
Requirement already satisfied: jupyter-client>=6.1.5 in c:\users\ander\anaconda3\lib\site-packages (from nbclient<0.6.0,>=0.5.0->nbconvert) (7.3.4)
Requirement already satisfied: nest-asyncio in c:\users\ander\anaconda3\lib\site-packages (from nbclient<0.6.0,>=0.5.0->nbconvert) (1.5.5)
Requirement already satisfied: jsonschema>=2.6 in c:\users\ander\anaconda3\lib\site-packages (from nbformat>=4.4->nbconvert) (4.16.0)
Requirement already satisfied: fastjsonschema in c:\users\ander\anaconda3\lib\site-packages (from nbformat>=4.4->nbconvert) (2.16.2)
Requirement already satisfied: soupsieve>1.2 in c:\users\ander\anaconda3\lib\site-packages (from beautifulsoup4->nbconvert) (2.3.1)
Requirement already satisfied: packaging in c:\users\ander\anaconda3\lib\site-packages (from bleach->nbconvert) (21.3)
Requirement already satisfied: webencodings in c:\users\ander\anaconda3\lib\site-packages (from bleach->nbconvert) (0.5.1)
Requirement already satisfied: six>=1.9.0 in c:\users\ander\anaconda3\lib\site-packages (from bleach->nbconvert) (1.16.0)
Requirement already satisfied: pywin32>=1.0 in c:\users\ander\anaconda3\lib\site-packages (from jupyter-core->nbconvert) (302)
Requirement already satisfied: pyrsistent!=0.17.0,!=0.17.1,!=0.17.2,>=0.14.0 in c:\users\ander\anaconda3\lib\site-packages (from jsonschema>=2.6->nbformat>=4.4->nbconvert) (0.18.0)
Requirement already satisfied: attrs>=17.4.0 in c:\users\ander\anaconda3\lib\site-packages (from jsonschema>=2.6->nbformat>=4.4->nbconvert) (21.4.0)
Requirement already satisfied: python-dateutil>=2.8.2 in c:\users\ander\anaconda3\lib\site-packages (from jupyter-client>=6.1.5->nbclient<0.6.0,>=0.5.0->nbconvert) (2.8.2)
Requirement already satisfied: tornado>=6.0 in c:\users\ander\anaconda3\lib\site-packages (from jupyter-client>=6.1.5->nbclient<0.6.0,>=0.5.0->nbconvert) (6.1)
Requirement already satisfied: pyzmq>=23.0 in c:\users\ander\anaconda3\lib\site-packages (from jupyter-client>=6.1.5->nbclient<0.6.0,>=0.5.0->nbconvert) (23.2.0)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in c:\users\ander\anaconda3\lib\site-packages (from packaging->bleach->nbconvert) (3.0.9)
Requirement already satisfied: pyppeteer in c:\users\ander\anaconda3\lib\site-packages (1.0.2)
Requirement already satisfied: importlib-metadata>=1.4 in c:\users\ander\anaconda3\lib\site-packages (from pyppeteer) (4.11.3)
Requirement already satisfied: websockets<11.0,>=10.0 in c:\users\ander\anaconda3\lib\site-packages (from pyppeteer) (10.4)
Requirement already satisfied: pyee<9.0.0,>=8.1.0 in c:\users\ander\anaconda3\lib\site-packages (from pyppeteer) (8.2.2)
Requirement already satisfied: certifi>=2021 in c:\users\ander\anaconda3\lib\site-packages (from pyppeteer) (2022.9.14)
Requirement already satisfied: urllib3<2.0.0,>=1.25.8 in c:\users\ander\anaconda3\lib\site-packages (from pyppeteer) (1.26.11)
Requirement already satisfied: tqdm<5.0.0,>=4.42.1 in c:\users\ander\anaconda3\lib\site-packages (from pyppeteer) (4.64.1)
Requirement already satisfied: appdirs<2.0.0,>=1.4.3 in c:\users\ander\anaconda3\lib\site-packages (from pyppeteer) (1.4.4)
Requirement already satisfied: zipp>=0.5 in c:\users\ander\anaconda3\lib\site-packages (from importlib-metadata>=1.4->pyppeteer) (3.8.0)
Requirement already satisfied: colorama in c:\users\ander\anaconda3\lib\site-packages (from tqdm<5.0.0,>=4.42.1->pyppeteer) (0.4.5)

Importaciones¶

Utilizamos las librerias: google maps pandas plotly.expressos *fuzz

In [2]:
import boto3 as boto3
import googlemaps
import pandas as pd
import plotly.express as px
import os
from thefuzz import fuzz, process

Ponemos nuestras variables de entorno

In [3]:
AWS_ACCESS_KEY_ID = 'AKIA6N4KQVAWMFTHHHSZ'
AWS_SECRET_ACCESS_KEY = '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'
BUCKET_NAME = 'groupbtest'
FOLDER_PATH = './Untitled Folder'
REMOTE_FILE_NAME = 'Prueba.txt'
GMAPS_KEY = 'AIzaSyA5UtvbK6lYWGjJbZtU3prV-U88EG0jRKs'
df = pd.DataFrame(columns=['Direccion']) 

Conexión a S3 de AWS

In [4]:
s3 = boto3.client('s3', aws_access_key_id = AWS_ACCESS_KEY_ID ,aws_secret_access_key = AWS_SECRET_ACCESS_KEY)

Listamos todos los archivos que tenemos en nuestra carpeta raiz, para previamente subirlos en el bucket conec3tado en S3

In [5]:
file_names = os.listdir(FOLDER_PATH)

for file_name in file_names:
    try:
        s3.upload_file(f'{FOLDER_PATH}/{file_name}',BUCKET_NAME,file_name)
        print(f'{file_name} se ha subido exitosamente a {BUCKET_NAME} como {file_name}')
    except FileNotFoundError:
        print(f'El archivo {FOLDER_PATH}/{file_name} no se encontró')
    except NoCredentialsError:
        print('No se encontraron credenciales de AWS')
Dir1.txt se ha subido exitosamente a groupbtest como Dir1.txt
Dir2.txt se ha subido exitosamente a groupbtest como Dir2.txt
Dir3.txt se ha subido exitosamente a groupbtest como Dir3.txt
Dir4.txt se ha subido exitosamente a groupbtest como Dir4.txt

Creación de método el cual nos genera las direcciones alternas que vienen en nuestros archivos planos

In [6]:
def direccionesAlternas(direccion):
    direcciones = []
    if direccion[4] == '-':
        direcciones.append('Carrera ' + direccion[1] + ' # ' + direccion[3] + ' ' + direccion[5])
        direcciones.append('Carrera ' + direccion[1] + ' Nro ' + direccion[3] + ' - ' + direccion[5])
        direcciones.append('Carrera ' + direccion[1] + ' Numero ' + direccion[3] + ' - ' + direccion[5])
        direcciones.append('Carrera ' + direccion[1] + ' Num ' + direccion[3] + ' - ' + direccion[5])
        direcciones.append('Kra ' + direccion[1] + ' Num ' + direccion[3] + ' - ' + direccion[5])
        direcciones.append('Calle ' + direccion[1] + ' Num ' + direccion[3] + ' - ' + direccion[5])
        direcciones.append('Trasversal ' + direccion[1] + ' Num ' + direccion[3] + ' - ' + direccion[5])

    if direccion[4] != '-':
        direcciones.append('Carrera ' + direccion[1] + ' # ' + direccion[3] + ' ' + direccion[4])
        direcciones.append('Carrera ' + direccion[1] + ' Nro ' + direccion[3] + ' - ' + direccion[4])
        direcciones.append('Carrera ' + direccion[1] + ' Numero ' + direccion[3] + ' - ' + direccion[4])
        direcciones.append('Carrera ' + direccion[1] + ' Num ' + direccion[3] + ' - ' + direccion[4])
        direcciones.append('Kra ' + direccion[1] + ' Num ' + direccion[3] + ' - ' + direccion[4])
        direcciones.append('Calle ' + direccion[1] + ' Num ' + direccion[3] + ' - ' + direccion[4])
        direcciones.append('Trasversal ' + direccion[1] + ' Num ' + direccion[3] + ' - ' + direccion[4])

    return direcciones

Método para calcular el porcentaje mediante la libreria Fuzz con la lógica difusa

In [7]:
def calcularPorcentaje(diroriginal):
    direccion = diroriginal
    direcciones = direccionesAlternas(direccion)
    temporal = " ".join(diroriginal)

    for direccionFictica in direcciones:
        ratio = fuzz.partial_ratio(temporal.lower(),direccionFictica.lower())
        if ratio >= 90:
            df.loc[len(df)] = direccionFictica.strip()

Respuesta de la lista de objetos que hay en el bucket previamente creado, se lee cada uno de los archivos se extrae la información, genera con el metodo direcciones alternas las direcciones y luego calculamos el porcentaje el que sea mayor o igual a 90 se agrega a un dataframe.

In [8]:
response = s3.list_objects_v2(Bucket=BUCKET_NAME)

for obj in response.get('Contents', []):
    try:
        response = s3.get_object(Bucket=BUCKET_NAME, Key=obj['Key'])
        data = response['Body'].read()
        data_str = data.decode('utf-8').strip()
        diroriginal = data_str.split(' ')
        calcularPorcentaje(diroriginal)
        
    except Exception as e:
        print(f"An error occurred: {e}")

Imprimimos Dataframe

In [9]:
print(df)
             Direccion
0  Carrera 70 # 26A 33
1  Carrera 70 # 26A 80
2   Carrera 70 # 70 88
3   Carrera 70 # 86 33

Inicializamos gmaps_key con la API de Google

In [10]:
gmaps_key = googlemaps.Client(key = GMAPS_KEY)

Agregamos a nuestro dataframe los nombres de columna LAT, LON, Color, Tamaño LAT y LON son igual al valor que nos responde el API al darle la dirección Color y tamaño ses para el dibujado del mapa

In [11]:
df['LAT'] = None
df['LON'] = None
df['Color'] = 1
df['Tamaño'] = 1

for i in range (0,len(df),1):
    geocode_result = gmaps_key.geocode(df.iat[i,0])
    try:
        lat = geocode_result[0]["geometry"]["location"]["lat"]
        lon = geocode_result[0]["geometry"]["location"]["lng"]
        df.iat[i,df.columns.get_loc("LAT")] = lat
        df.iat[i,df.columns.get_loc("LON")] = lon
    except:
        lat = None
        lon = None

Vemos la versión final del dataframe

In [12]:
df
Out[12]:
Direccion LAT LON Color Tamaño
0 Carrera 70 # 26A 33 6.228612 -75.591616 1 1
1 Carrera 70 # 26A 80 6.228965 -75.591167 1 1
2 Carrera 70 # 70 88 6.216905 -75.592414 1 1
3 Carrera 70 # 86 33 6.23553 -75.591561 1 1

Dibujamos el mapa con plotly recibiendo el dataframe y sus valores

In [13]:
fig = px.scatter_mapbox(df, lon = df['LON'], lat = df['LAT'], zoom = 10, color = df['Color'] , size = df['Tamaño'], width=900 , height=600 ,title='DIRECTIONS MAP')
fig.update_layout(mapbox_style = "open-street-map")
fig.update_layout(margin = {"r":0,"t":50,"l":0,"b":10})
fig.show()